Introduction to Open Data Science - Course Project

About the project

I have learned about the course Introduction to Open Data Science through my PhD studies. I’m exited to start the course but also a bit worried about the workload. The course Moodle page seems to contain a lot of information, so probably will take a while to get a handle on things.

I have previously used R and Rstudio on some courses, but I’m no expert. I think that exercise 1 was too long, and all in all there was too much stuff to be learned about R in the first week. To me, this one really long exercise feels exhausting. It is nice that there are the R codes already embedded, but I feel smaller chunks would be better. If I hadn’t done R before, I would be terrified right now.

Link to my GitHub repository


Regression and model validation

date()
## [1] "Mon Dec 12 15:37:47 2022"

Basic information about dataset

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5      v purrr   0.3.4 
## v tibble  3.1.2      v dplyr   1.0.10
## v tidyr   1.1.3      v stringr 1.4.0 
## v readr   1.4.0      v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
students2014 <- read_csv("./data/learning2014.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   gender = col_character(),
##   age = col_double(),
##   attitude = col_double(),
##   deep = col_double(),
##   stra = col_double(),
##   surf = col_double(),
##   points = col_double()
## )
dim(students2014)
## [1] 166   7
str(students2014)
## spc_tbl_ [166 x 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )

Dataset is created from data: https://www.mv.helsinki.fi/home/kvehkala/JYTmooc/JYTOPKYS3-meta.txt
Data is collected from students in course ‘Johdatus yhteiskuntatilastotieteeseen’ in fall 2014.

Dataset students2014 contains 7 variables with 166 observations: gender, age, attitude, deep, stra, surf, and points.

All variables in the original data related to strategic, deep, and surface learning have been combined into three variables (stra, deep, and surf, respectively), and scaled to original scales by taking the mean.

Attitude tells about student’s global attitude towards statistics. Variable has been scaled back to the original scale of the questions, by dividing it with the number of questions.

Points tells the course exam points. Observations with points equaling 0 have been excluded, since those students didn’t attend the course exam.

Describing the data

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a plot matrix with ggpairs()
p <- ggpairs(students2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

summary(factor(students2014$gender))
##   F   M 
## 110  56
summary(students2014$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   21.00   22.00   25.51   27.00   55.00
summary(students2014$attitude)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   2.600   3.200   3.143   3.700   5.000
summary(students2014$deep)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   3.333   3.667   3.680   4.083   4.917
summary(students2014$stra)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.250   2.625   3.188   3.121   3.625   5.000
summary(students2014$surf)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.583   2.417   2.833   2.787   3.167   4.333
summary(students2014$points)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   19.00   23.00   22.72   27.75   33.00

There are 66% (110/166) women in respondents. The median age is 22 years (range 17-55), the median age of women seems to be a bit lower than that of men.

The median attitude is 3.2 (range 1.4-5). The median attitude of men seems higher than that of women.

The median for deep learning is 3.7 (range 1.6-4.9). The median for strategic learning is 3.1 (range 1.3-5). The median for surface learning is 2.8 (range 1.6-4.3). For women, the median for strategic and surface learning seems a bit higher than for men.

The median for exam points is 23 (range 7-33). The median is about the same in men and women.

Attitude correlates with exam points, so that with “better” attitude towards statistics one usually gets higher exam points.

Surface learning correlates negatively with deep and strategic learning as well as attitude. So with better attitude one less often utilizes surface learning. If one utilizes surface learning, one less often utilizes deep or strategic learning.

Interpreting the fitted regression model

#fit a linear regression model using points as the outcome and attitude, strategig and surface learning as explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = students2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
#remove surface learning from the model, since it's so far from significant
my_model2 <- lm(points ~ attitude + stra, data = students2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
#error rate
sigma(my_model2)*100/mean(students2014$points)
## [1] 23.28148

First we choose three variables that correlate the most with points in above visualization (namely attitude, strategic and surface learning). However, after fitting the model, we see that for surface learning there is a really high p-value (0.466), so we remove that and fit the model again with just attitude and strategic learning.

In the new model, p-value of the F-statistic is 7.734e-09. So at least one of the explanatory variables is significantly associated with exam points.

For attitude, p-value is <0.001, so it is significantly associated with exam points. According to this model, if attitude increases by 1, exam points increase by approximately 3.5.

For strategic learning, p-value is 0.089. It isn’t <0.05 that’s usually deemed significant, however, it is quite close. So we probably don’t want to state that strategic learning has no effect on exam points. On the other hand, the estimated intercept for strategic learning is 0.9, so a change in attitude makes a much bigger difference than a change in strategic learning.

The adjusted R-squared is 0.1951, which means that attitude (and strategic learning) only explain around 20% of the variance in exam points.

The residual standard error is 5.289 which corresponds to 23% prediction error rate.

Assumptions of the model

plot(my_model2, which=c(1,2,5))

Linear regression assumes a linear relationship between predictors and outcome, that residual errors are normally distributed, that residuals have a constant variance, and independence of residual error terms.

With Residuals vs. Fitted plot we can check the linearity of data. Here, the red line seems to be horizontal at approximately zero, where it should be. So we can assume a linear relationship.

With plot Normal Q-Q we can check if residual errors are normally distributed. The plot of residuals should more or less follow the dashed line. Here they mostly do so, with some exceptions at the upper and lower ends. However, we may assume normality.

The Residuals vs. Leverage plot marks three most extreme points (35, 71, 145). Two of them exceed 3 standard deviations, so they are possible outliers.


Logistic regression

date()
## [1] "Mon Dec 12 15:38:00 2022"

Short description of dataset

library(tidyverse)
alc <- read_csv("./data/alc.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_character(),
##   age = col_double(),
##   Medu = col_double(),
##   Fedu = col_double(),
##   traveltime = col_double(),
##   studytime = col_double(),
##   famrel = col_double(),
##   freetime = col_double(),
##   goout = col_double(),
##   Dalc = col_double(),
##   Walc = col_double(),
##   health = col_double(),
##   failures = col_double(),
##   absences = col_double(),
##   G1 = col_double(),
##   G2 = col_double(),
##   G3 = col_double(),
##   alc_use = col_double(),
##   high_use = col_logical()
## )
## i Use `spec()` for the full column specifications.
dim(alc)
## [1] 370  35
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Dataset is created from data: https://archive.ics.uci.edu/ml/datasets/Student+Performance
The data approaches student achievement in secondary education of two Portuguese schools, and the dataset joins two student alcohol consumption datasets. The variables not used for joining the two data have been combined by averaging (including the grade variables). ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’. ‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise.

Dataset alc contains 35 variables with 370 observations.

My personal hypothesis about relationships of with alcohol consumption

I think interesting variables in relation to alcohol consumption are sex, number of past class failures, final grade (G3) and number of school absences.
My hypothesis for these variables are:
1. Men’s alcohol consumption is higher than women’s.
2. Those who consume more alcohol have more past class failures compared to those who consume less alcohol.
3. Those who consume more alcohol have lower final grades compared to those who consume less alcohol.
4. Those who consume more alcohol have more school absences compared to those who consume less alcohol.

Description of chosen variables and their relationships with alcohol consumption

library(dplyr); library(ggplot2); library(GGally); library(finalfit)


dependent <- "high_use"
explanatory <- c("sex", "failures", "G3", "absences")
alc %>% 
  summary_factorlist(dependent, explanatory, p = TRUE,
                     add_dependent_label = TRUE)
## Warning in chisq.test(failures, high_use): Chi-squared approximation may be
## incorrect
##  Dependent: high_use                FALSE       TRUE      p
##                  sex         F 154 (59.5)  41 (36.9) <0.001
##                              M 105 (40.5)  70 (63.1)       
##             failures         0 238 (91.9)  87 (78.4)  0.003
##                              1   12 (4.6)  12 (10.8)       
##                              2    8 (3.1)    9 (8.1)       
##                              3    1 (0.4)    3 (2.7)       
##                   G3 Mean (SD) 11.8 (3.4) 10.9 (3.0)  0.011
##             absences Mean (SD)  3.7 (4.5)  6.4 (7.1) <0.001
#median and quartiles of the numeric variables
summary(alc$failures)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1892  0.0000  3.0000
summary(alc$G3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   12.00   11.52   14.00   18.00
summary(alc$absences)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   4.511   6.000  45.000
#draw a barplot of alcohol consumption by sex
ggplot(data = alc, aes(x = high_use, fill=sex)) + geom_bar()

#produce summary statistics of final grade by alcohol consumption and sex
alc %>% group_by(sex, high_use) %>% summarise(mean_grade=mean(G3),count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use mean_grade count
##   <chr> <lgl>         <dbl> <int>
## 1 F     FALSE          11.4   154
## 2 F     TRUE           11.8    41
## 3 M     FALSE          12.3   105
## 4 M     TRUE           10.3    70
#draw a boxplot of final grade by alcohol consumption and sex
ggplot(alc, aes(x = high_use, y = G3, col=sex)) + geom_boxplot() + ylab("grade")

#produce summary statistics of school abscences by alcohol consumption and sex
alc %>% group_by(sex, high_use) %>% summarise(mean_absences=mean(absences),count = n())
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 x 4
## # Groups:   sex [2]
##   sex   high_use mean_absences count
##   <chr> <lgl>            <dbl> <int>
## 1 F     FALSE             4.25   154
## 2 F     TRUE              6.85    41
## 3 M     FALSE             2.91   105
## 4 M     TRUE              6.1     70
#draw a boxplot of school absences by alcohol consumption and sex
ggplot(alc, aes(x = high_use, y = absences, col=sex)) + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")

There are 195 women and 175 men. Of those, 21% (41/195) and 40% (70/175) have high alcohol consumption, respectively. This supports my hypothesis of men consuming more alcohol than women.

The mean of past class failures is 0.2 (range 0-3). Of high users, 3% (3/111) have 3 failures, 8% (9/111) have 2 failures, and 11% have 1 failure. Of others, 0.4% (1/259) have 3 failures, 3% (8/259) have 2 failures, and 5% have 1 failure. Thus it would seem that students with high alcohol consumption have failed class more often than those with low alcohol consumption.

The mean final grade is 11.5 (range 0-18). Men with high alcohol consumption seem to have lower final grades than people with low alcohol consumption. Interestingly, women with high alcohol consumption seem to have around the same final grades than people with low alcohol consumption. So my hypothesis is supported only by observations of men.

The mean number of absences is 4.5 (range 0-45, median 3). The median of school absences is higher for men with high alcohol consumption than other groups. However, if we look at the mean, women with high alcohol consumption have the highest mean of absences, and both men and women with high alcohol consumption have higher mean than those with low alcohol consumption. This is somewhat in line with my hypothesis.

Logistic regression

# fit a logistic regression model 
my_model <- glm(high_use ~ sex + failures + absences + G3, data = alc, family = "binomial")

# print out a summary of the model
summary(my_model)
## 
## Call:
## glm(formula = high_use ~ sex + failures + absences + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1561  -0.8429  -0.5872   1.0033   2.1393  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.38733    0.51617  -2.688  0.00719 ** 
## sexM         1.00870    0.24798   4.068 4.75e-05 ***
## failures     0.50382    0.22018   2.288  0.02213 *  
## absences     0.09058    0.02322   3.901 9.56e-05 ***
## G3          -0.04671    0.03948  -1.183  0.23671    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 405.59  on 365  degrees of freedom
## AIC: 415.59
## 
## Number of Fisher Scoring iterations: 4
# fit a new model without G3 model 
my_model2 <- glm(high_use ~ sex + failures + absences, data = alc, family = "binomial")

# print out a summary of the model
summary(my_model2)
## 
## Call:
## glm(formula = high_use ~ sex + failures + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1550  -0.8430  -0.5889   1.0328   2.0374  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.94150    0.23129  -8.394  < 2e-16 ***
## sexM         0.99731    0.24725   4.034 5.49e-05 ***
## failures     0.59759    0.20698   2.887  0.00389 ** 
## absences     0.09245    0.02323   3.979 6.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 406.99  on 366  degrees of freedom
## AIC: 414.99
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(my_model2) %>% exp

# compute confidence intervals (CI)
CI <- confint(my_model2) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1434892 0.08940913 0.2219242
## sexM        2.7109881 1.67967829 4.4367282
## failures    1.8177381 1.21997630 2.7642305
## absences    1.0968598 1.05041937 1.1508026

In the first model, final grade has a p-value 0.2, and thus isn’t associated with high alcohol consumption. Let’s fit the model again without it. In the second model, all variables are associated with the outcome, and also the AIC drops from 415.59 to 414.99, which indicates that G3 can be dropped from the model.

According to this model, high alcohol consumption is associated with being male (OR 2.7, 95% CI 1.7-4.4, p-value <0.001), having failed class (OR 1.8, 95% CI 1.2-2.8, p-value 0.004), and school absences (OR 1.1, 95% CI 1.1-1.2, p-value <0.001).
My primary hypothesis were correct regarding being male and having failed classes. However, there is only slight association with school absence (OR 1.1), and the final grade doesn’t seem to be associated with high alcohol consumption.

Predictive power of the model

# predict() the probability of high_use
probabilities <- predict(my_model2, type = "response")

library(dplyr)
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252    7
##    TRUE     78   33
# draw a plot of 'high_use' versus 'probability' in 'alc'
ggplot(alc, aes(x = probability, y = high_use)) + geom_point()

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2297297

In the 2x2 table we can see that 252 low and 33 high alcohol consumers are predicted correctly. In the predictions, however, there are 78 false negatives and 7 false positives. According to the training error, 23% of the individuals are classified inaccurately. This is better than tossing a coin (where the chances would be 50/50), but I guess with some background information one could get quite close with educated guesses.
## 10-fold cross-validation of the model

# 10-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = my_model2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2378378
#Let's try the same with another model. 
my_model3 <- glm(high_use ~ traveltime + sex + absences +studytime + goout, data = alc, family = "binomial")
summary(my_model3)
## 
## Call:
## glm(formula = high_use ~ traveltime + sex + absences + studytime + 
##     goout, family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8435  -0.7908  -0.4914   0.6936   2.5211  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.72342    0.67103  -5.549 2.88e-08 ***
## traveltime   0.35147    0.18157   1.936 0.052908 .  
## sexM         0.81643    0.27225   2.999 0.002710 ** 
## absences     0.07919    0.02293   3.454 0.000552 ***
## studytime   -0.40974    0.17327  -2.365 0.018041 *  
## goout        0.71830    0.12152   5.911 3.40e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 363.51  on 364  degrees of freedom
## AIC: 375.51
## 
## Number of Fisher Scoring iterations: 4
prob2 <- predict(my_model3, type = "response")
loss_func(class = alc$high_use, prob = prob2)
## [1] 0.2027027
cv2 <- cv.glm(data = alc, cost = loss_func, glmfit = my_model3, K = 10)
cv2$delta[1]
## [1] 0.2162162

The prediction error of the model when using 10-fold cross-validation is around 0.24, which is slightly less than in the exercise set.
With the other tested model (explanatory variables being sex, home to school travel time, number of school absences, weekly study time, and going out with friends), the prediction error with 10-fold cross-validation is around 0.22, which is still lower.


Clustering and classification

date()
## [1] "Mon Dec 12 15:38:05 2022"

Short description of dataset

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
# load the data 
data("Boston")

# check structure and dimensions
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The dataset contains housing values in suburbs of Boston. It can be downloaded from R’s MASS package, and contains 506 observations of 14 variables. More information on the data and variables can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html

Graphical overview of the data and summaries of the variables

library(tidyr)
# check summary of data
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# draw pairs plot of the variables
pairs(Boston)

# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
cor_matrix %>% round(2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor_matrix, method="circle")

The summary of the data shows means, medians and ranges of the different variables, for example median number of rooms per dwelling is 6.2 (range 3.6-8.8).

Tax rate and accessibility to radial highways seem to be strongly positively correlated. Median value and lower status of the population are negatively correlated. The plotted correlation matrix shows also other positive and negative correlations. The darker blue a sphere is, the stronger the positive correlation between variables, and the darker red a sphere is, the stronger the negative correlation.

Standardizing the dataset

library(dplyr)
# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

boston_scaled$crim <- as.numeric(boston_scaled$crim)

# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, label = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

The variables are now transformed on the same scale, so now we can compare them.

Linear discriminant analysis and predictions

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2648515 0.2524752 0.2326733 0.2500000 
## 
## Group means:
##                  zn      indus          chas        nox         rm        age
## low       0.9793983 -0.9445242 -0.1251477505 -0.8943020  0.4421935 -0.8937512
## med_low  -0.1185809 -0.2392569 -0.0021359143 -0.5286081 -0.1231768 -0.3238176
## med_high -0.4215561  0.2549072  0.2721635181  0.3353375  0.1473120  0.4441279
## high     -0.4872402  1.0171306  0.0005392655  1.0384627 -0.3684256  0.8230764
##                 dis        rad        tax     ptratio      black       lstat
## low       0.8963825 -0.6974378 -0.7163000 -0.47028916  0.3798894 -0.77181353
## med_low   0.2871340 -0.5461293 -0.4620609  0.01102945  0.3108629 -0.14698541
## med_high -0.3531785 -0.3929764 -0.2601417 -0.16766260  0.1362980 -0.03713606
## high     -0.8618847  1.6379981  1.5139626  0.78062517 -0.6814148  0.91542900
##                   medv
## low       0.5179299172
## med_low   0.0006968892
## med_high  0.1617255201
## high     -0.6798453345
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.09998939  0.698285806 -0.75223286
## indus    0.09336963 -0.523837286  0.33337708
## chas    -0.07966935 -0.070932758  0.07242003
## nox      0.43557982 -0.586733465 -1.39242377
## rm      -0.06089623 -0.142188564 -0.23835664
## age      0.24986481 -0.352631944 -0.36116086
## dis     -0.01489872 -0.389990560 -0.08959452
## rad      3.21035901  0.852216938  0.02206920
## tax     -0.08075876  0.146695062  0.37109290
## ptratio  0.11901459 -0.007267877 -0.07593604
## black   -0.12004946  0.005044376  0.14052733
## lstat    0.25844057 -0.182895080  0.38140068
## medv     0.20886452 -0.288977347 -0.18669411
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9515 0.0374 0.0112
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15       3        2    0
##   med_low    4      19        1    0
##   med_high   1      12       18    1
##   high       0       0        0   26

It would seem that with high crime rate the predictions are most accurate. With low to medium and medium to high there is most inaccuracy.

Distance measures and k-means clustering

library(ggplot2)
data(Boston)
boston_scaled2 <- as.data.frame(scale(Boston))

# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# k-means clustering
set.seed(123)
km <- kmeans(boston_scaled2, centers = 3)

# plot part of the Boston dataset with clusters
pairs(boston_scaled2[6:10], col = km$cluster)

# determine the maximum number of clusters
k_max <- 10

# calculate the total within sum of squares
set.seed(123)
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
set.seed(123)
km <- kmeans(boston_scaled2, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)

# plot part of the Boston dataset with clusters
pairs(boston_scaled2[6:10], col = km$cluster)

From plotting the total within sum of squares we see that around two the value changes quite a lot, so the appropriate number of cluster would be two. The data seems to divide nicely between two clusters according to most variables.

Bonus

data(Boston)
boston_scaled3 <- as.data.frame(scale(Boston))

# k-means clustering
set.seed(123)
km2 <- kmeans(boston_scaled3, centers = 3)

# linear discriminant analysis
lda.fit2 <- lda(km2$cluster ~., data = boston_scaled3)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(km2$cluster)

# plot the lda results
plot(lda.fit2, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit2, myscale = 4)

Variables age, black, and tax seem to be the most influential linear separators for the clusters.

Super-Bonus

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
lda.fit3 <- lda(crime ~., data = train)
model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit3$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit3$scaling
matrix_product <- as.data.frame(matrix_product)

classes <- as.numeric(train$crime)
train2 <- dplyr::select(train, -crime)
set.seed(123)
km3 <- kmeans(train2, centers = 4)
clusters <- as.numeric(km3$cluster)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=classes)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=clusters)

The cluster that is on the left-hand side seems quite similar in both plots and is quite well defined. In the second plot the clusters are more defined, in the first plot the rest of the clusters mix more with each other.


Dimensionality reduction techniques

date()
## [1] "Mon Dec 12 15:38:26 2022"

Short description of dataset

human <- read.table("./data/human.txt", header = TRUE)
dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

The data originates from the United Nations Development Programme
Original data from: http://hdr.undp.org/en/content/human-development-index-hdi

Meta file: https://hdr.undp.org/data-center/human-development-index#/indicies/HDI
Technical notes: https://hdr.undp.org/system/files/documents//technical-notes-calculating-human-development-indices.pdf

Dataset contains 155 observations of 8 variables.

Variable names:
“GNI” = Gross National Income per capita
“Life.Exp” = Life expectancy at birth
“Edu.Exp” = Expected years of schooling
“Mat.Mor” = Maternal mortality ratio
“Ado.Birth” = Adolescent birth rate
“Parli.F” = Percetange of female representatives in parliament
“Edu2.FM” = Proportion of females with at least secondary education / Proportion of males with at least secondary education
“Labo.FM” = Proportion of females in the labour force / Proportion of males in the labour force

Rownames are the countries.

Graphical overview of the data and summaries of the variables

library(GGally)
library(corrplot)
# check summary of data
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# draw pairs plot of the variables
ggpairs(human)

# calculate and visualize the correlation matrix
cor(human) %>% corrplot(., method="circle")

Median life expectancy at birth is 74 years (49-84 years). Median expected years of schooling is 13.5 years (5-20 years). Median gross national income per capita (GNI) is 12 040 (581-123 124). Median maternal mortality ratio is 49 (1-1 100). Median adolescent birth rate is 47 (1-205). Median percentage of female representatives in parliament is 1% (0-58%).

Expected years of schooling correlates positively with life expectancy, GNI, and percentage of female representatives in parliament. It correlates negatively with maternal mortality ratio and adolescent birth rate. Life expectancy also correlates positively with GNI, and negatively with maternal mortality ratio and adolescent birth rate. Maternal mortality ratio correlates negatively with GNI and adolescent birth rate. These findings make a lot of sense.

Principal component analysis (PCA)

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Principal component analysis (PCA) with standardized variables

# standardize the variables
human_std <- scale(human)

# perform principal component analysis (with the SVD method)
pca_human_std <- prcomp(human_std)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"))
text(-8, 9, "developed countries", col="blue")
text(10, 3, "developing countries", col="blue")
text(0, -9, "Arab countries", col="blue")

# create and print out a summary of pca_human_std
s <- summary(pca_human_std)

# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

# print out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3

The results are different, because in the non-standardized data the variables have different scales. Therefore the variances in variables in the non-standardized data are higher. That is also why the first PCA biplot is dominated by GNI (widest scale).

According to the biplot of the standardized dataset, African countries (so-called developing countries) seem to have higher adolescent birt rate and maternal mortality ratio. Many European and other western countries, as well as Australia and Singapore, have higher life expectancy and expected years of schooling, as well as higher proportion of women in parliament. These countries can also be descriped as welfare states or developed countries. In the Arab countries, the proportion of women in parliament is low, as well as proportion of women working compared to men.

The first two principal components of the standardized dataset explain 53.6% and 16.2% of variability, respectively. PC1 is positively correlated with Mat.Mor and Ado.Birth, and negatively correlated with Edu.Exp, Life.Exp, GNI, and Edu2.FM. PC2 is positively correlated with Parli.F and Labo.FM.

# Work with the exercise in this chunk, step-by-step. Fix the R code!
# pca_human, dplyr are available

# create and print out a summary of pca_human
s <- summary(pca_human_std)
s
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

# print out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human_std, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

Multiple Correspondence Analysis (MCA) on the tea data

library(dplyr)
library(tidyr)
library(ggplot2)
library(FactoMineR)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

# Check the structure and dimensions of data
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
# Browse data
View(tea)

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "friends", "age_Q", "tearoom", "tea.time")

# select the 'keep_columns' to create a new dataset
new_tea <- select(tea, all_of(keep_columns))

# Check the summaries and structure of the data
str(new_tea)
## 'data.frame':    300 obs. of  8 variables:
##  $ Tea     : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How     : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how     : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar   : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ age_Q   : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ tea.time: Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
summary(new_tea)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                                                                     
##         friends      age_Q           tearoom            tea.time  
##  friends    :196   +60  :38   Not.tearoom:242   Not.tea time:131  
##  Not.friends:104   15-24:92   tearoom    : 58   tea time    :169  
##                    25-34:69                                       
##                    35-44:40                                       
##                    45-59:61
# visualize the dataset
pivot_longer(new_tea, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free")+geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

# multiple correspondence analysis
mca <- MCA(new_tea, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = new_tea, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.231   0.204   0.162   0.155   0.136   0.131   0.129
## % of var.             12.319  10.893   8.653   8.279   7.244   6.976   6.906
## Cumulative % of var.  12.319  23.212  31.865  40.144  47.389  54.364  61.270
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.117   0.113   0.101   0.094   0.093   0.076   0.068
## % of var.              6.246   6.005   5.403   5.032   4.973   4.077   3.651
## Cumulative % of var.  67.516  73.521  78.925  83.957  88.930  93.006  96.657
##                       Dim.15
## Variance               0.063
## % of var.              3.343
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.143  0.029  0.011 |  0.598  0.583  0.186 | -0.449
## 2                  |  0.131  0.025  0.009 |  0.962  1.509  0.467 | -0.438
## 3                  |  0.123  0.022  0.015 |  0.056  0.005  0.003 | -0.067
## 4                  | -0.766  0.847  0.546 |  0.085  0.012  0.007 | -0.353
## 5                  | -0.251  0.091  0.050 |  0.634  0.656  0.318 | -0.083
## 6                  | -0.519  0.389  0.255 |  0.248  0.101  0.058 | -0.266
## 7                  |  0.153  0.034  0.017 | -0.097  0.015  0.007 | -0.279
## 8                  |  0.440  0.279  0.086 |  0.611  0.609  0.167 | -0.736
## 9                  |  0.280  0.113  0.037 |  0.108  0.019  0.006 | -0.334
## 10                 |  0.661  0.630  0.170 |  0.288  0.135  0.032 |  0.087
##                       ctr   cos2  
## 1                   0.414  0.105 |
## 2                   0.394  0.097 |
## 3                   0.009  0.004 |
## 4                   0.256  0.116 |
## 5                   0.014  0.005 |
## 6                   0.146  0.067 |
## 7                   0.160  0.058 |
## 8                   1.112  0.242 |
## 9                   0.229  0.053 |
## 10                  0.016  0.003 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.921  11.324   0.278   9.113 |   0.596   5.368   0.116
## Earl Grey          |  -0.324   3.665   0.190  -7.535 |  -0.412   6.698   0.307
## green              |  -0.168   0.167   0.003  -1.019 |   1.075   7.782   0.143
## alone              |  -0.141   0.695   0.037  -3.313 |   0.060   0.141   0.007
## lemon              |   0.038   0.008   0.000   0.229 |  -0.700   3.295   0.060
## milk               |   0.082   0.077   0.002   0.735 |   0.234   0.706   0.015
## other              |   2.331   8.820   0.168   7.088 |  -0.368   0.248   0.004
## tea bag            |  -0.308   2.910   0.124  -6.091 |   0.201   1.396   0.053
## tea bag+unpackaged |   0.322   1.758   0.047   3.761 |  -0.609   7.117   0.169
## unpackaged         |   0.614   2.446   0.051   3.919 |   0.643   3.038   0.056
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                5.900 |  -0.287   1.567   0.027  -2.841 |
## Earl Grey           -9.579 |  -0.070   0.242   0.009  -1.621 |
## green                6.536 |   1.052   9.383   0.137   6.397 |
## alone                1.405 |   0.212   2.260   0.084   5.006 |
## lemon               -4.253 |   0.514   2.243   0.033   3.128 |
## milk                 2.090 |  -0.715   8.273   0.136  -6.375 |
## other               -1.119 |  -1.484   5.087   0.068  -4.511 |
## tea bag              3.968 |  -0.552  13.324   0.399 -10.924 |
## tea bag+unpackaged  -7.116 |   0.526   6.690   0.126   6.149 |
## unpackaged           4.107 |   1.234  14.082   0.208   7.881 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.280 0.324 0.145 |
## How                | 0.177 0.072 0.232 |
## how                | 0.131 0.189 0.443 |
## sugar              | 0.225 0.087 0.019 |
## friends            | 0.030 0.428 0.024 |
## age_Q              | 0.366 0.281 0.395 |
## tearoom            | 0.356 0.126 0.021 |
## tea.time           | 0.281 0.127 0.019 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_mca_var(mca, choice = "mca.cor", 
            repel = TRUE, 
            ggtheme = theme_minimal())

For this exercise, tea data from FactoMineR package is used. 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).

There are 300 observations of 36 variables, of which we select 8 variables.

Agegroup +60 cluster with drinking black tea. Drinking green tea clusters with drinking tea without friends. Teabag tea is drank not in a tearoom and without additives. Earl Grey is drank with sugar.

Variables agegroup and tearoom are most correlated with dimension 1. Friends is most correlates with dimension 2.


Analysis of longitudinal data

date()
## [1] "Mon Dec 12 15:38:46 2022"
library(dplyr)
library(tidyr)
library(ggplot2)
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
library(ggpubr)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car

Analysis of Longitudinal Data I: Graphical Displays and Summary Measure Approach

RATSL <- read.table("./data/RATSL.txt", header=TRUE)
# Factor variables ID and Group
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)

# Glimpse the data
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,~
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, ~
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", ~
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470~
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, ~
# Check the column names
names(RATSL)
## [1] "ID"     "Group"  "WD"     "Weight" "Time"
# Look at the structure of dataset
str(RATSL)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
# Print out summaries of the variables
summary(RATSL)
##        ID      Group       WD                Weight           Time      
##  1      : 11   1:88   Length:176         Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   Class :character   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   Mode  :character   Median :344.5   Median :36.00  
##  4      : 11                             Mean   :384.5   Mean   :33.55  
##  5      : 11                             3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11                             Max.   :628.0   Max.   :64.00  
##  (Other):110

The dataset RATSL is a long format version of data from a nutrition study conducted in three groups of rats. The groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately) weekly, except in week seven when two recordings were taken) over a 9-week period.

The dataset has 176 rows and 5 columns (ID, group, weight, and day in borth character and integer formats).

Plot the RATSL data

# Plot the data
ggplot(RATSL, aes(x = Time, y = Weight, group = ID)) +
  geom_line(aes(linetype=Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weight (grams)") + 
  theme(legend.position = "top")

Summary graph

# Calculate the weightgain
RATSL <- 
  RATSL %>%
  group_by(ID) %>%
  mutate(Weightgain = Weight - Weight[Time == 1] ) %>%
  ungroup()

# Plot the weightgain
ggplot(RATSL, aes(x = Time, y = Weightgain, group = ID)) +
  geom_line(aes(linetype=Group)) +
  scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
  scale_y_continuous(name = "Weightgain (grams)") + 
  theme(legend.position = "top")

# Summary data with mean and standard error of weightgain by group and time 
RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weightgain), se = sd(Weightgain) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups`
## argument.
# Glimpse the data
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36, ~
## $ mean  <dbl> 0.000, 4.375, 3.750, 11.250, 14.000, 14.375, 16.750, 16.625, 18.~
## $ se    <dbl> 0.000000, 4.955156, 10.264363, 8.988087, 8.734169, 7.308263, 7.7~
# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  geom_point(size=3) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(Weightgain) +/- se(Weightgain)")

Based on the graph, group 2 seems to have gained most weight in the study period.

Checking for outliers

# Draw a boxplot of the mean weightgain versus group
ggplot(RATSL, aes(x = Group, y = Weightgain)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean weightgain, days 1-64")

No outliers detected.

T-test and Anova

# Pairwise t-test comparisons
pwc <- RATSL %>%
  pairwise_t_test(Weightgain ~ Group, p.adjust.method = "bonferroni")
pwc
## # A tibble: 3 x 9
##   .y.        group1 group2    n1    n2            p p.signif      p.adj p.adj.~1
## * <chr>      <chr>  <chr>  <int> <int>        <dbl> <chr>         <dbl> <chr>   
## 1 Weightgain 1      2         88    44 0.0000000434 ****     0.00000013 ****    
## 2 Weightgain 1      3         88    44 0.206        ns       0.619      ns      
## 3 Weightgain 2      3         44    44 0.000157     ***      0.000471   ***     
## # ... with abbreviated variable name 1: p.adj.signif
# Visualization: box plots with p-values
pwc <- pwc %>% add_xy_position(x = "Group", step.increase = 1)
ggboxplot(RATSL, x = "Group", y = "Weightgain") +
  stat_pvalue_manual(pwc, hide.ns = TRUE) 

# Fit the linear model with the mean as the response 
fit <- lm(Weightgain~Group, data = RATSL)
summary(fit)
## 
## Call:
## lm(formula = Weightgain ~ Group, data = RATSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.955 -13.091   1.909   8.920  60.045 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   13.091      1.800   7.275 1.16e-11 ***
## Group2        17.864      3.117   5.731 4.34e-08 ***
## Group3         3.955      3.117   1.269    0.206    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.88 on 173 degrees of freedom
## Multiple R-squared:  0.1615, Adjusted R-squared:  0.1518 
## F-statistic: 16.66 on 2 and 173 DF,  p-value: 2.425e-07
# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: Weightgain
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## Group       2   9493  4746.3  16.656 2.425e-07 ***
## Residuals 173  49299   285.0                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the pairwise t-tests and anova, there seems to be a significant difference in weight gain between groups 1 and 2, and 2 and 3.

Analysis of Longitudinal Data II: Linear Mixed Effects Models for Normal Response Variables

BPRSL <- read.table("./data/BPRSL.txt", header=TRUE)
# Factor treatment & subject
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)

# Glimpse the data
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0~
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, ~
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
# Check the column names
names(BPRSL)
## [1] "treatment" "subject"   "weeks"     "bprs"      "week"
# Look at the structure of dataset
str(BPRSL)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
# Print out summaries of the variables
summary(BPRSL)
##  treatment    subject       weeks                bprs            week  
##  1:180     1      : 18   Length:360         Min.   :18.00   Min.   :0  
##  2:180     2      : 18   Class :character   1st Qu.:27.00   1st Qu.:2  
##            3      : 18   Mode  :character   Median :35.00   Median :4  
##            4      : 18                      Mean   :37.66   Mean   :4  
##            5      : 18                      3rd Qu.:43.00   3rd Qu.:6  
##            6      : 18                      Max.   :95.00   Max.   :8  
##            (Other):252

The dataset BPRSL is a long format version of BPRS, in which 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe). The scale is used to evaluate patients suspected of having schizophrenia.

The dataset has 360 rows and 5 columns (treatment, subject, bprs, and week in both character and integer formats).

I ran out of time for the interpretations of this second part.

Plot the BPRLS data

# Plot the data
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))  

The linear model

# create a regression model BPRS_reg
BPRS_reg <- lm(bprs~week+treatment, data=BPRSL)

# print out a summary of the model
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

The random intercept model

# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)

# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Random intercept and random slope model

# create a random intercept and random slope model
BPPS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPPS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
# perform an ANOVA test on the two models
anova(BPPS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPPS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPPS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Random intercept and random slope model with interaction

# create a random intercept and random slope model with the interaction
BPPS_ref2 <- lmer(bprs ~ week + treatment + (week | subject) + week*treatment, data = BPRSL, REML = FALSE)

# print a summary of the model
summary(BPPS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + week * treatment
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(BPPS_ref2, BPPS_ref1)
## Data: BPRSL
## Models:
## BPPS_ref1: bprs ~ week + treatment + (week | subject)
## BPPS_ref2: bprs ~ week + treatment + (week | subject) + week * treatment
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPPS_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPPS_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# draw the plot of BPRSL with the observed bprs values
ggplot(BPRSL, aes(x = week, y = bprs, group = subject)) +
  geom_line() +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
  scale_y_continuous(name = "bprs") +
  theme(legend.position = "top")

# Create a vector of the fitted values
Fitted <- fitted(BPPS_ref2)

# Create a new column fitted to RATSL
BPRSL$Fitted <- Fitted

# draw the plot of RATSL with the Fitted values of weight
ggplot(BPRSL, aes(x = week, y = Fitted, group = subject)) +
  geom_line() +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
  scale_y_continuous(name = "bprs") +
  theme(legend.position = "top")